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Background: Many recent statistical applications involve inference under complex models, where it is 
computationally prohibitive to calculate likelihoods but possible to simulate data. Approximate Bayesian 
Computation (ABC) is devoted to these complex models because it bypasses evaluations of the likelihood 
function using comparisons between observed and simulated summary statistics. 

Results: We introduce the R abc package that implements several ABC algorithms for performing parameter 
estimation and model selection. In particular, the recently developed non-linear heteroscedastic regression 
methods for ABC are implemented. The abc package also includes a cross-validation tool for measuring the 
accuracy of ABC estimates, and to calculate the misclassification probabilities when performing model selection. 
The main functions are accompanied by appropriate summary and plotting tools. Considering an example of 
demographic inference with population genetics data, we show the potential of the R package. 
Conclusions: R is already widely used in bioinformatics and several fields of biology. The R abc package will 
make the ABC algorithms available to the large number of R users, abc is a freely available R package under the 
GPL license, and it can be downloaded at |http:/ /cran.r- project. org/web/packages/abc/index.html| 
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Background 

In recent years, Approximate Bayesian computation (ABC) has become a popular method for parameter 
inference and model selection under complex models, where the evaluation of the likelihood function is 
computationally prohibitive. ABC bypasses exact likelihood calculations via the use of summary statistics 
and simulations, which, in turn, allows the consideration of highly complex models. ABC was first 
proposed in population genetics to do inference under coalescent models [l][2], but it is now increasingly 
applied in other fields, such as ecology or systems biology (see for reviews of ABC methods and 
applications). Software implementations of ABC dedicated to particular problems have already been 
developed in these fields [6|{T7) (See additional file : table giving the main features of the ABC software). 



The integration of ABC in a software package poses several challenges. First, data simulation, which is in 
the core of any ABC analysis, is specific to the model in question. Thus, many existing ABC software are 
specific to a particular class of models [7||9 15 or even to the estimation of a particular parameter [s]. 



Further, model comparison is an integral part of any Bayesian analysis, thus it is essential to provide 
software, where users are able to fit different models to their data. Second, an ABC analysis often follows 
trial-error approaches, where users try different models, ABC algorithms, or summary statistics. Therefore, 
it is important that users can run different analyses using batch files, which contain each analysis as a 
sequence of commands. Third, ABC is subject to intensive research and many new algorithms have been 



published in the past few years 18 -21 . Thus, an ABC software should be flexible enough to accommodate 
new developments in the field. 



Here, we introduce a generalist R package, abc, which aims to address the above challenges 22 . The price 
to pay for the generality and flexibility is that the simulation of data and the calculation of summary 
statistics are left to the users. However, simulation software might be called from an R session, which opens 
up the possibility for a highly interactive ABC analysis. For coalescent models, for instance, users can 

. R 



apply one of the many existing software for simulating genetic data such as ms 23 or simcoal2 24 
provides many advantages in the context of ABC: (i) R already possesses the necessary tools to handle, 
analyze and visualize large data sets, (ii) sequences of R commands can be saved in a script file, (iii) R is a 
free and collaborative project, thus new algorithms can be easily integrated to the package (e.g. via 
contributions from their authors). 
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Implementation 

The main steps of an ABC analysis follow the general scheme of any Bayesian analysis: formulating a 
model, fitting the model to data (parameter estimation), and improving the model by checking its fit 
(posterior predictive checks) and comparing it to other models [5|[25]. abc provides functions for the 
inference and model comparison steps, and we indicate how generic tools of the R-base package can be 
used for model checking. 

In order to use the package the following data should be prepared for R: a vector of the observed summary 
statistics, a matrix of the simulated summary statistics, where each row corresponds to a simulation and 
each column corresponds to a summary statistic, and finally, matrix of the simulated parameter values, 
where each row corresponds to a simulation and each column corresponds to a parameter. 

Parameter inference 

For the sake of clarity, we recall the general scheme of parameter estimation with ABC. Suppose that we 
want to compute the posterior probability distribution of a univariate or multivariate parameter, 9. A 
parameter value 9i, is sampled from its prior distribution to simulate a dataset yi, for i = 1, . . . ,n where n 
is the number of simulations. A set of summary statistics S{yi) is computed from the simulated data and 
compared to the summary statistics obtained from the actual data S{yo) using a distance measure d. If the 
distance d{S (yi) , S (yo)) is less than a given threshold, the parameter value 9i is accepted. The accepted 
0,;'s form a sample from an approximation of the posterior distribution, which can be improved by the use 
of regression techniques (see below). 

The function abc implements three ABC algorithms for constructing the posterior distribution from the 
accepted S^'s: a rejection method, and regression-based correction methods that use either local linear 
regression jlj or neural networks 26|. When the rejection method ("rejection") is selected, the accepted 
Sj's are considered as a sample from the posterior distribution |2 . The two regression methods 
("loclinear" and "neuralnet") implement an additional step to correct for the imperfect match between 
the accepted, S{yi), and observed summary statistics, S{yo), using the following regression equation in the 
vicinity of S{yQ) 

9i = m{S{yi)) + Ci, 

where m is the regression function, and the e^'s are centered random variables with a common variance. 
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Simulations that closely match S{yo) are given more weight by assigning to each simulation {9i, S{yi)) the 
weight K[d{S {yi) , S {yo))], where K is a statistical kernel. The local linear model ("loclinear") assumes a 
linear function for m, while neural networks account for the non-linearity of m and allow users to reduce 
the dimension of the set of summary statistics. Once the regression is performed, a weighted sample from 
the posterior distribution is obtained by correcting the ^,;'s as follows, 

0* = m{S{yo)) + 

where m(-) is the estimated conditional mean and the eVs are the empirical residuals of the regression [l]. 
Additionally, a correction for heteroscedasticity is applied, by default, in abc, 

e.^ = m{S{yo)) + ^, , 

<^{S{yi)) 

where a{-) is the estimated conditional standard deviation [26] . 

The function abc returns an object of class "abc" that can be printed, summarized and plotted using the 
S3 methods of the R generic functions, print, summary, hist and plot. The function print returns a 
summary of the object. The function summary calculates summaries of the posterior distributions, such as 
the mode, mean, median, and credible intervals, taking into account the posterior weights, when 
appropriate. The hist function displays the histogram of the weighted posterior sample. The plot 
function generates various plots that allow the evaluation of the quality of estimation when one of the 
regression methods is used. The following plots are generated: a density plot of the prior distribution, a 
density plot of the posterior distribution estimated with and without regression-based correction, a scatter 
plot of the Euclidean distances as a function of the parameter values, and a normal Q-Q plot of the 
residuals from the regression. When the heteroscedastic regression model is used, a normal Q-Q plot of the 
standardized residuals is displayed. 

Finally, we note that alternative algorithms exist that sample from an updated distribution that is closer in 



shape to the posterior than to the prior 13 27 28 . However, we do not implement these methods in R abc 



because they require the repeated use of the simulation software. 
Cross-validation 

The function cv4abc performs a leave-one-out cross-validation to evaluate the accuracy of parameter 
estimates and the robustness of the estimates to the tolerance level. To perform cross-validation, the i*'' 
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simulation is randomly selected as a validation simulation, its summary statistic(s) S{yi) are used as 
"dummy" observed summary statistics, and its parameters are estimated via abc using all simulations 
except the i*'' simulation. Ideally, the process is repeated n times, where n is the number of simulations 
(so-called n-fold cross-validation). However, performing an n-fold cross-validation might take up too much 
time, so the cross-validation is often performed for a subset of typically 100 randomly selected simulations. 
The summary S3 method of cv4abc computes the prediction error as 

"""" ~ Var{e,) ' 

where 0i is the true parameter value of the z*'' simulated data set and 0i is the estimated parameter value. 
The plot function displays the estimated parameter values as a function of the true values. 



Model selection 

The function postpr implements model selection to estimate the posterior probability of a model M as 
Pr{M\S{yo)). Three different methods are implemented. With the rejection method ("rejection"), the 
posterior probability of a given model is approximated by the proportion of accepted simulations given this 
model. The two other methods are based on multinomial logistic regression ("mnlogistic") or neural 
networks ("neuralnet"). In these two approaches, the model indicator is treated as the response variable 



of a polychotomous regression, where the summary statistics are the independent variables 29 . Using 
neural networks can be efficient when highly dimensional statistics are used. Any of these methods are 
valid when the different models to be compared are, a priori, equally likely, and the same number of 
simulations are performed under each model. The postpr's S3 method for summary displays the posterior 
model probabilities, and calculates the ratios of model probabilities, the Bayes factor, for all possible pairs 



of models 30 



A further function, expected. devicince, is implemented to guide the model selection procedure. The 
function computes an approximate expected deviance from the posterior predictive distribution. Thus, in 
order to use the function, users have to re-use the simulation tool and to simulate data from the posterior 
parameter values. The method is particularly advantageous when it is used with one of the regression 



methods. Further details on the method can be found in 31 and fully worked out examples are provided 
in the package manual. 
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Model misclassification 

A cross-validation tool is available for model selection as well. The objective is to evaluate if model 
selection with ABC is able to distinguish between the proposed models by making use of the existing 
simulations. The summary statistics from one of the simulations are considered as pseudo-observed 
summary statistics and classified using all the remaining simulations. Then, one expects that a large 
posterior probability should be assigned to the model that generated the presudo-observed summary 
statistics. Two versions of the cross-validation are implemented. The first version is a "hard" model 
classification. We consider a given simulation as the pseudo-observed data assign it to the model for which 
postpr gives the highest posterior model probability. This procedure is repeated for a given number of 
simulations for each model. The results are summarized in a so-called confusion matrix [32| . Each row of 
the confusion matrix represents the number of simulations under a true model, while each column 
represents the number of simulations under a model assigned by postpr. If all simulations had been 
correctly classified, only the diagonal elements of the matrix are non-zero. The second version is called 
"soft" classification. Here we do not assign a simulation to the model with the highest posterior 
probability, but average the posterior probabilities over many simulations for a given model. This 
procedure is again summarized as a matrix, which is similar to the confusion matrix. However, the 
elements of the matrix do not give model counts, but the average posterior probabilities across simulations 
for a given model. The matrices can be visualized using a barplot using the S3 method for plot. 

Results and Discussion 

In following sections we show the utility of the abc package using an example of parameter inference in 
human demographic history. 

Background and data 

Several studies have found evidence that African human populations have been expanding, while human 
populations outside of Africa went through a population bottleneck and then expanded. We re-examined 
this problem by re-analyzing published data of 50 unlinked autosomal non-coding regions from a Hausa 
(Cameroon) population and a European (Italy) population |33 . Data were summarized using three 
summary statistics: the mean and the variance of Tajima's D and the mean heterozygosity (H) (Table 2 
in [33]). Both Tajima's D and H have been classically used to detect historical changes in population size. 
Here, we analyzed these data using abc to provide a walk-through analysis of the functions of the R 
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package. We reached conclusions in accordance with 33 , who analyzed these data with a different method 



(see details in |33 ). We found support for the hypothesis that the Hausa population is expanding whereas 
the Italian population has experienced a bottleneck. Additionally, we estimated the ancestral Italian 



population size to be around 10, 000 individuals, which is similar to the results of 33 



Model selection 

We considered three different models of demographic history: constant population size, exponential growth 
after a period of constant population size, and population bottleneck, where, after the bottleneck, the 
population recovered to its original size. All three models can be defined by a number of demographic 
parameters, such as population sizes, rates and timings of changes in population size, which we do not 
detail here. We simulated 50,000 data sets under each of the three demographic models using the software 



ms 23 , and calculated the three summary statistics in each model. 

Before applying the model selection function (postpr) on the real data, we performed a cross-validation for 
model selection (cv4postpr) to evaluate if ABC can, at all, distinguish between these three models. We 
performed 500 cross-validation simulations for each model. The following confusion matrix (using the S3 
method for summary) and the barplot (using the S3 method for plot, Figure 1) illustrate that ABC is able 
to distinguish between these three models, and, notably, it is the bottleneck model that can be classified 
correctly the most frequently: 396 times out of 500. 

Confusion matrix based on 500 samples for each model. 
$tol0.01 

bott const exp 
bott 396 97 7 
const 92 321 87 
exp 34 144 322 



Mean model posterior probabilities using the mnlogistic method. 
$tol0.01 

bott const exp 
bott 0.7202 0.2154 0.0644 
const 0.2204 0.5085 0.2711 
exp 0.0791 0.2960 0.6249 
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Next, we calculated the posterior probabilities of each demographic scenario using the rejection 
("rejection") and the multinomial logistic regression method ("mnlogistic") of the function postpr. 
Considering three different tolerance rates, 0.1%, 0.5%, and 1%, we found that the Hausa data best 
supported the model of exponential growth whereas the Italian data best supported the bottleneck model 
(see Table 1). Application of the summary method of postpr produced the following output for the Italian 
data: 

Call: 

postpr (target = tajima.obs$italy, index = models, sumstat = 

tajima.sim, tol = 0.1, method = "mnlogistic") 
Data: 

postpr . out$values (15000 posterior samples) 
Models a priori : 

bott , const , exp 
Models a posteriori: 

bott , const , exp 



Proportion of accepted simulations (rejection) : 

bott const exp 
0.873 0.113 0.014 



Bayes factors 
bott 

bott 1 . 000 
const 0.129 
exp 0.016 



const exp 
7.754 61.201 
1.000 7.893 
0.127 1.000 



Posterior model probabilities (mnlogistic) : 

bott const exp 
0.991 0.009 0.001 



8 



Bayes factors: 



bott 



const 



exp 



bott 



1.000 



113.154 1745.512 



const 



0.009 



1.000 



15.426 



exp 



0.001 



0.065 



1.000 



Posterior predictive checks 

To further confirm which model provided the best fit to the data, we considered posterior predictive 



task can be easily carried out using the simulation software (here ms) and R. Here, we illustrate how the 
posterior predictive checks were run for the Italian sample. 

First, we estimated the posterior distributions of the parameters of the three models using abc. Then, we 
sampled a set of 1,000 multivariate parameters from their posterior distribution. Last, we obtained a 
sample from the distribution of the three summary statistics a posteriori by simulating data sets with the 
1,000 sampled multivariate parameters using ms (see Figure 2). We found that the model of exponential 
growth was unable to account for the observed level of heterozygosity when accounting for the remaining 
two summary statistics in the Italian sample. The bottleneck and the constant-size population were, 
however, able to reproduce the observed values of the summary statistics. Such posterior predictive checks 
use the summary statistics twice; once for sampling from the posterior and once for comparing the 
marginal posterior predictive distributions to the observed values of the summary statistics. An alternative 
approach of model criticism use the summary statistics only once but is based on ABC-MCMC, which is 
not implemented in the package |34| . 

Parameter inference 

Next we inferred the parameters of the bottleneck model for the Italian data. The bottleneck model can be 
described with four parameters: the ancestral population size Na, the ratio of the population sizes before 
and during the bottleneck, the duration of the bottleneck, and the time since the beginning of the 
bottleneck. We show only the estimation of the ancestral population size Na- We first assessed if ABC was 
able to estimate the parameter Na at all. We used the function cv4abc to determine the accuracy of ABC 
and the sensitivity of estimates to the tolerance rate. Figure 3 shows the estimated values of Na as a 



checks [25J. Note that there is no specific function in abc for posterior predictive checks, nevertheless the 
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function of the true values for three distinct tolerance rates. The posterior distribution of Na was 
summarized with its median. The points of the cross-validation plot are scattered around the identity line 
indicating that Na can be well estimated using the three summary statistics. Further, estimates were not 
only accurate for A^^, but also insensitive to the tolerance rate (see Figure 3). Accordingly, the prediction 
error was found to be around 0.20 independently of the tolerance rate. 

When using abc with the method "neuralnet", we obtained the following summary of the posterior 
distribution using the function summary: 

Data: 

abc. out$adj .values (500 posterior samples) 
Weights : 
abc . out$weights 





Ancestral_Size 


Min. : 


7107. 


,099 


Weighted 2.5 7. Perc. : 


7604, 


,957 


Weighted Median: 


9951, 


,219 


Weighted Mean: 


10148, 


.146 


Weighted Mode: 


10004, 


.551 


Weighted 97.5 7. Perc; 


: 14284, 


.484 


Max. : 


16505, 


.348 



The function plot can be used as a diagnostic tool for the visuaUzation of the output of abc for the 
parameter A^^. Figure 4 shows that the posterior distribution is very different from the prior distribution, 
confirming that the three summary statistics convey information about the ancestral population size 
(Figure 4 lower left panel). The upper right panel of Figure 4 shows the distance between the simulated 
and observed summary statistics as a function of the prior values of Na- Again, this plot confirms that the 
summary statistics convey information about Na, because the distances corresponding to the accepted 
values are clustered and not spread around the prior range of Na- The lower right panel displays a 
standard Q-Q plot of the residuals of the regression, This plot serves as a regression diagnostic of 
locf it or nnet, when the method is "loclinear" or "neuralnet". 
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Conclusions 

We provide an R package abc to perform model selection and parameter estimation via Approximate 
Bayesian Computation. Integrating abc within the R statistical environment offers high quality graphics 
and data visualization tools. The R package implements recently developed non-linear methods for ABC, 
and is going to evolve as new algorithms and methods accumulate. 

Availability and requirements 

Project name: R abc 



Project home page: http: / /cran.r- project .org/web / packages /abc/ index.html 
Version: 1.3 

Operating system(s): Windows, Linux, MacOS 
Programming language: R 

Other requirements: R version >2.10 and the R packages: nnet, quantreg, locfit 
License: GNU GPL>3 

Any restrictions to use by non-academics: None 
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MB and KC implemented the algorithms and methods, and wrote the paper. KC created the R package. 
OF critically reviewed the paper and tested the package. 
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Figures 

Figure 1 - Model misclassification 

Graphical illustration of the confusion matrix for three demographic models. Each color from dark to light 
gray corresponds to a model (bottleneck, constant, exponential, respectively). If the simulations were 
perfectly classified, each bar would have a single color of its own corresponding model. 

Figure 2 - Posterior predictive checks 

Distribution of the three summary statistics a posteriori for the three demographic models. Vertical bars 
correspond to the values computed from the Italian data. 

Figure 3 - Cross-validation for parameter estimation 

Estimated vahics as a function of true parameter values of the ancestral population size, Na, under the 
bottleneck model. The plot has been generated by the function plot from an object of class "cv4abc". 
Point estimates are obtained using posterior median. 

Figure 4 - ABC regression diagnostics 

Diagnostic plot of an object of class "abc", generated by the function plot. The upper left panel shows the 
prior distribution. The lower left panel shows the posterior distribution obtained with and without the 
regression correction method, and the prior distribution, for reference. The upper right panel displays the 
distances between observed and simulated summary statistics as a function of the parameter values. Red 
points indicate the accepted values. The lower right panel is a Normal Q-Q plot of the residuals of the 
regression, here from nnet. 
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Tables 

Table 1 - Posterior probabilities of the different demographic models 



Population 


ivietnou 


Tolerance rate 


Constant 


Expansion 


Bottleneck 






0.001 


0.28 


0.61 


0.11 




Regression 


0.005 


0.35 


0.52 


0.13 


Hausa 




0.01 


0.34 


0.54 


0.12 


(Africa) 




0.001 


0.27 


0.58 


0.15 
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